PubChem: atom environments for molecule standardization
نویسندگان
چکیده
PubChem is an open repository for molecular structures, their properties and biological activities [1]. The number of deposited structures has been steadily increasing since its creation in 2004. Today, it contains more than 92 million substances (PubChem Substance) with 32 million unique small molecules (PubChem Compound). Consequently, visual inspection of every structure and correction of errors by hand to detect structure equivalencies and to ensure data quality are not feasible. Efficient and reliable automated methods for standardization are necessary during the registration process to compensate for alternating representations of as well as errors and artifacts in (sub)structure representations caused by diverging business rules, personal preferences, data format conversion, disagreements between aromaticity definitions and automated library generation. At PubChem, we are developing a new standardization approach that is based on rules for atom environment transformation. Those rules are obtained from a statistical analysis of atom environment transformations observed with a reference workflow combining chemical reasonability checks, valence filters, canonical tautomer determination and aromaticity normalization. Additional atom environment mappings are provided by hand curation. In the first application of our technique to PubChem we concentrate on purely organic compounds. Those represent 97% of the deposited structures and account for the majority of atom environments as well. Here, we present the first results obtained with our approach, highlighting the methodology, challenges, benefits and future possibilities.
منابع مشابه
PubChem atom environments
BACKGROUND Atom environments and fragments find wide-spread use in chemical information and cheminformatics. They are the basis of prediction models, an integral part in similarity searching, and employed in structure search techniques. Most of these methods were developed and evaluated on the relatively small sets of chemical structures available at the time. An analysis of fragment distributi...
متن کاملPubChem3D: a new resource for scientists
BACKGROUND PubChem is an open repository for small molecules and their experimental biological activity. PubChem integrates and provides search, retrieval, visualization, analysis, and programmatic access tools in an effort to maximize the utility of contributed information. There are many diverse chemical structures with similar biological efficacies against targets available in PubChem that a...
متن کاملExpanding the fragrance chemical space for virtual screening
The properties of fragrance molecules in the public databases SuperScent and Flavornet were analyzed to define a "fragrance-like" (FL) property range (Heavy Atom Count ≤ 21, only C, H, O, S, (O + S) ≤ 3, Hydrogen Bond Donor ≤ 1) and the corresponding chemical space including FL molecules from PubChem (NIH repository of molecules), ChEMBL (bioactive molecules), ZINC (drug-like molecules), and GD...
متن کاملNCBI PubChem BioAssay Database
NCBI’s PubChem BioAssay database (1-5) (http://pubchem.ncbi.nlm.nih.gov) is a public repository for archiving biological tests of small molecules and siRNA reagents. Small molecule bioactivity data contained in the BioAssay database consist of information generated through high-throughput screening experiments, medicinal chemistry studies, chemical biology research, as well as literature curati...
متن کاملPubChem: Integrated Platform of Small Molecules and Biological Activities
PubChem is an open repository for experimental data identifying the biological activities of small molecules. PubChem contents include more than: 1,000 bioassays, 28 million bioassay test outcomes, 40 million substance contributed descriptions, and 19 million unique compound structures contributed from over 70 depositing organizations. PubChem provides a significant, publicly accessible platfor...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره 5 شماره
صفحات -
تاریخ انتشار 2013